Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

نویسندگان

Ammar Ahmad Awan

Ching-Hsiang Chu

Hari Subramoni

Dhabaleswar K. Panda

چکیده

Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, specialpurpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. In this paper, we propose a pipelined chain (ring) design for the MPI Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/inter-node multi-GPU communication. We present an in-depth performance landscape for the proposed MPI Bcast schemes along with a comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intraand inter-node broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

dMath: A Scalable Linear Algebra and Math Library for Heterogeneous GP-GPU Architectures

A new scalable parallel math library, dMath, is presented in this paper that demonstrates leading scaling when using intranode, or internode, hybrid-parallelism for deep-learning. dMath provides easy-to-use distributed base primitives and a variety of domain-specific algorithms. These include matrix multiplication, convolutions, and others allowing for rapid development of highly scalable appli...

متن کامل

A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters

Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing...

متن کامل

Full Bandwidth Broadcast, Reduction and Scan with Only Two Trees

We present a new, simple algorithmic idea for exploiting the potential for bidirectional communication present in many modern interconnects for the collective MPI operations broadcast, reduction and scan. Our algorithms achieve up to twice the bandwidth of most previous and commonly used algorithms. In particular, our algorithms for reduction and scan are the currently best known. Experiments o...

متن کامل

Deep learning with COTS HPC systems

Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloudlike computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system ba...

متن کامل

High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters

The All-to-all broadcast collective operation is commonly used in parallel scientific applications. This collective operation is called MPI Allgather in the context of MPI. Contemporary MPI implementations use the Recursive Doubling and Ring algorithms for implementing this collective on top of MPI point-to-point calls. This leads to several performance bottlenecks. Depending on message size an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1707.09414 شماره

صفحات -

تاریخ انتشار 2017

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

نویسندگان

چکیده

منابع مشابه

dMath: A Scalable Linear Algebra and Math Library for Heterogeneous GP-GPU Architectures

A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters

Full Bandwidth Broadcast, Reduction and Scan with Only Two Trees

Deep learning with COTS HPC systems

High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters

عنوان ژورنال:

اشتراک گذاری